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Abstract —This paper studies an optimal control problem 
for continuous-time stochastic systems subject to reachability 
objectives specified in a subclass of metric interval temporal 
logic specifications, a temporal logic with real-time constraints. 
We propose a probabilistic method for synthesizing an optimal 
control policy that maximizes the probability of satisfying 
a specification based on a discrete approximation of the 
underlying stochastic system. First, we show that the original 
problem can be formulated as a stochastic optimal control 
problem in a state space augmented with finite memory and 
states of some clock variables. Second, we present a numerical 
method for computing an optimal policy with which the 
given specification is satisfied with the maximal probability 
in point-based semantics in the discrete approximation of 
the underlying system. We show that the policy obtained in 
the discrete approximation converges to the optimal one for 
satisfying the specification in the continuous or dense-time 
semantics as the discretization becomes finer in both state and 
time. Finally, we illustrate our approach with a robotic motion 
planning example. 

I. Introduction 

Stochastic optimal control is an important research area 
for analysis and control design for continuous-time dynami¬ 
cal systems that operate in the presence of uncertainty. How¬ 
ever, existing stochastic control methods cannot be readily 
applied to handle complex temporal logic specifications 
with real-time constraints, which are of growing interest 
to the design of autonomous and semiautonous systems 
[l]-[4]. In this paper, we propose a numerical method for 
stochastic optimal control with respect to a subclass of 
metric temporal logic specifications. Particularly, given a 
specification encoding desirable properties of a continuous¬ 
time stochastic system, the task is to synthesize a control 
policy such that if the system implements the policy, then the 
probability of a path satisfying the formula is maximized. 

Metric temporal logic (MTL) is one of many real-time 
logics that not only express the relative temporal ordering of 
events as linear temporal logic (LTL), but also the duration 
between these events. For example, a surveillance task of 
a mobile robot, infinitely revisiting region 1 and 2, can 
be expressed in LTL. But tasks with quantitative timing 
constraints, for instance, visiting region 2 within 5 minutes 
after visiting region 1, require the expressive power of MTL. 
For system specifications in LTL and its untimed variants, 
methods have been developed for quantitative verification 
of discrete-time stochastic hybrid systems [5], [6], control 

J. Fu and U. Topcu are with the Department of Electrical and Systems 
Engineering, University of Pennsylvania, Philadelphia, PA 19104, USA 
jief, utopcu@seas.upenn.edu. 


design of continuous-time and discrete-time linear stochastic 
systems [7], [8]. For MTL and its variants, a specification- 
guided testing framework is proposed in [9] for verification 
of stochastic cyber-physical systems. Reference [10] pro¬ 
poses a solution to the vehicle routing problem with respect 
to MTL specifications. Reference [11] develops an abstrac¬ 
tion technique and a method of transforming MTL formulas 
to LTL formulas. As a result, existing synthesis methods for 
discrete deterministic systems with LTL constraints can be 
applied to design switching protocols for continuous-time 
deterministic systems in dynamical environment subject 
to MTL constraints. Reference [12] proposes a reactive 
synthesis method to non-deterministic systems with respect 
to maximizing the robustness of satisfying a specification 
in signal temporal logic, which is a subclass of MTL. The 
robustness of a path is measured by the distance between 
this path and the set of paths that satisfy the specification. 
Our work differs from existing ones in both the problem for¬ 
mulation and control objective. We deal with systems with 
stochastic dynamics, rather than non-deterministic systems 
[11], [12]. We consider reachability objectives specified in 
metric interval temporal logic (MITL), which is a subclass 
of MTL. The optimality of control design is evaluated by 
the probability of satisfying the given specification. The 
synthesis method is with respect to quantitative criteria (the 
probability of satisfying the formula), not qualitative criteria 
(whether the formula is satisfied). 

Our solution approach utilizes the Markov chain approxi¬ 
mation method [13] to generate a discrete abstraction in the 
form of a Markov decision process (MDP) approximating 
the continuous-time stochastic system. Based on a product 
operation between the discrete abstraction and a finite-state 
automaton that represents the desirable system property, 
a near optimal policy with respect to the probability of 
satisfying the formula in the point-based semantics of MITL 
[14] can be computed by solving an optimal planning 
problem in the MDP. We show that as the discretization gets 
finer in both state space and time space, the optimal control 
policy in the abstract system converges to the optimal one in 
the original stochastic system with respect to the probability 
of satisfying the MITL formula in the continuous or dense¬ 
time semantics [15]. 

11. Preliminaries and problem formulation 
A. The system model and timed behaviors 

We study stochastic dynamical systems in continuous 
time. The state of the system evolves according to the 



stochastic differential equation (SDE) 


SDE • i ^ /(a^(^)> uit))dt + g{x{t))dw, 

■ \ x(0) = xo, 


( 1 ) 


where / : X x {7 —> M" and g : X —>■ are contin¬ 

uous and bounded functions given X and U as compact 
state and input space; w{-) is an K^-valued, J^-Wiener 
process which serves as a “driving noise” and is defined 
on the probability space (0,X, P); x{-) is an X-valued, 
Pt-adapted, measurable process also defined on (SI, X, P) 
and u(-) is an admissible control law, i.e., a {/-valued, Xt- 
adapted, measurable process defined on {id,X,P). We say 
x(-),u{-) solve the SDE in Q provided that 


x{t) = a:(0) 


/(x(r),u(r))dr -f 


g{x{T))dw{T), 

( 2 ) 


holds for all time t > 0. 

We introduce a labeling function that relates a sample 
path of the SDE in (0 to a timed behavior. Let AV be a 
finite set of atomic propositions and L : X 2^^ be a 
labeling function that maps each state x G X to a set of 
atomic propositions that evaluate true at that state. 

A time interval / is a convex set {t, t') where t, t' G 
the symbol ‘(’ can be one of and the symbol ‘)’ can 

be one of and t < t' . Eor a time interval of the above 

form, t and t' are left and right end-points, respectively. A 
time interval is empty if it contains no point. A time interval 
is singular if t = f and it contains exactly one point. 

Definition 1: [16] A dense-time behavior over an 

infinite-time domain [0,(X)) and a set AV of atomic propo¬ 
sitions is a function b : [0, oo) —2-^^ which maps every 
time instant t > 0 to a set b{t) G 2-^^ of atomic propositions 
that hold at t. 

Given a continuous sample path x{-,uj),uj G fl of the 
stochastic process x(-), the timed behavior b of this sample 
path is b(t) = L{x{t,w)), for all f > 0. 

Definition 2: [16] Let {) be a dense-time behavior and 

<5 G K>o be a positive real number, referred to as the 
sampling interval. The canonical sampling b^ = G 

Z>o} of the timed behavior b is defined such that 6^ = 
b{nS). 


B. Specifications 

We introduce metric interval temporal logic [15], a sub¬ 
class of MTL, to express system specifications. 

Definition 3 (Metric interval temporal logic): Given a 
set AV of atomic propositions, the formulas of MITE are 
built from AV by Boolean connectives and time-constrained 
versions of the until operator U as follows. 

tp ■.= T \ 1 .\ pi ^ ip 2 \ \ P \ P1U1P2 

where p G AV, / is a nonsingular time interval with 
integer end-points, and T, _L are unconditional true and 
false, respectively. 


Dense-time semantics ofMITL: Given a timed behavior 
b, we define b,t \= p with respect to an MITE formula p 
at time t inductively as follows: 

• b,t \= p where p G AV if and only if p G b{t)', 

• b,t \= -^p where b,t ^ p-, 

• b,t \= Pi A p 2 if and only if b, t |= pi and b, t \= p 2 ', 

• b,t \= P 1 U 1 P 2 if and only if there exists fi G I such 
that b, 1= p 2 and for all f" G [0, t'), b, t-ft” \= pp. 

We write b \= p if b,0 \= p. We also define temporal 
operator (}ip = TUjp (eventually, p will hold within 
interval / from now) and Djp = -^{(}i^p) (for all points 
within I, p holds.) 

C. Timed automata 

An MITE formula can be translated into equivalent non- 
deterministic timed automaton [15]. We consider a fragment 
of MITE which can be translated into equivalent determin¬ 
istic timed automaton. 

Let S be a finite alphabet. E*, E“ are the sets of finite and 
infinite words (sequences of symbols) over E. A (infinite) 
timed word [17] over E is a pair w = {T,a), where 
a = tToO"!... G E*^ is an infinite word and t = toTi ... is 
an infinite timed sequence, which satisfies 1) Initialization: 
tq = 0; 2) Monotonicity: t increases strictly monotonic ally; 
i.e., Ti < Ti+i, for all z > 0; 3) Progress: Eor every n > 1 
and Tq < t < Tn, there exists some z > 0, such that > t. 
The conditions ensure that there are finitely many symbols 
(events) in a bounded time interval, known as non-Zenoness. 
We also write w = (t, a) = (tq, cro)(Tii cn) ■ ■ 

Before the introduction of timed automata, we introduce 
clock and clock constraints: Let C be a finite set of clocks, 
C = {ci, C 2 ,..., Cm}- We define a set $0 of clock 
constraints over C in the following manner. Let A: G N 
be a non-negative integer, and c<iG {=, 9 ^, <,>,>,<} be a 
comparison operator, 

p := T |_L| c c<i fc I c — c' ixi fc I Pi A (/32 I V p 2 , 
where c,c' G C are clocks. 

Definition 4: [17] A deterministic timed automaton is a 

tuple A = {Q, 2-^^, Init, F, C, T) where Q is a finite set of 
states, 2^^ is a finite set of alphabet with the set AV of 
atomic propositions, Init is the initial state, E" is a finite set 
of accepting states, C is a finite set of clocks. The transition 
function T : Q x 2-^^ x -A Qx2^ is deterministic and 
interpreted as follows: If T(q, a, fi) = (q', C) then A allows 
a transition from q to q' when the set a G 2-^^ of atomic 
propositions evaluate true and the clock constraint G 
is met. After taking this transition, the clocks in C" C 67 
are reset to zero, while other clocks remain unchanged. 

Eor each clock G 67, we denote Vi the range of that 
clock. Eor notational convenience, we define a clock vector 
V G where the z-th entry z;[z] of the clock vector v is 
the value of clock Ci, for z G {1,2,..., Mj. Given t G K>o, 
let u © t = (z;[l] + t, v[2] + t,..., v\M] + t). We use 0 for 
the clock vector v where z;[z] = 0 for all z G (1, 2,..., M} 
and V = Ili^i mK the set of all possible clock vectors in 


A. Note that a clock vector is essentially a clock valuation 
defined in [17]. 

A configuration of .4 is a pair {q, v) where q is a state 
and u is a clock vector. A transition T{q,a,(l)) = {q',C') 
being taken from the configuration {q, v) after S time units 
is also written as {q,v) W,v') where v (B S \= and 

v'[i] = t;[f] + 5 if Ci ^ C, otherwise v'\i\ = 0. 

A run in on a timed word w = (tq, ao)(Ti, oi)... 
is an infinite alternating sequence of configurations and 

11 1 • • /1 • ^ \ ^0 ^^0 / \ 1 ^ 1 

delayed transitions p = (lnit,0) - > (go^'^o) -^ 

{qi,vi) ..with Atq = tq and At^ = Ti — Ti_i for f > 1, 
subject to the following conditions: 

1) There exists Cg C C and G $<7 such that 0 0 tq |= 
00 , T(lnit,ao,0o) = {qo,Co) and uo[i] = tq for all 
Ci ^ Co and uo[i] = 0 for all a £ Cq- 

2) For each f > 0, there exist Ci+i C C and fii+i G 
such that Vi 0 Ar^+i satisfies the clock constraint 
0*+i, T(gi,a,+i,0j+i) = {q,+i,C,+i) is defined and 
Vi+i[k] = Vi[k] + Ari+i for all Cfc ^ Q+i and 
Vi+i[k] = 0 for all Ck G Ci+i. 

We consider reachability objectives: A run p on a timed 
word w is accepting if and only if Occ(p) n F 7 ^ 0 where 
Occ(p) is the set of states in Q occurring in p. The set of 
timed words on which runs are accepted by A is called the 
language of A, denoted C{A). 

Example 1: As a simple example of timed automata, 
let AV = {i?i} and the specification formula is = 
0[3 5 ]i?i. The reachability specification can be expressed 
with a deterministic timed automaton A^p in Figure [T] The 
set of final states is F — {gi}. The timed automaton 
accepts a timed word with a prefix (0, {^i?i})(3.5, 

i.e., w = (0, {^i?i})(3.5, .. ■ since (Init, 0) 

(InitjO) ^ (gi,0) and gi is accepting. It does 

not accept w = (0, {^i?i})(2.8, ..., w = 

(r, for an arbitrary timed sequence r, or ly = 

(0, {^i?i})(6, ... because either Ri is evaluated true 

when c < 3 or c > 5, or it is never true over an infinite 
timed sequence. 


-ii2i,c<5? T,T,c:=0 



Fig. 1: Timed automaton Ap for p = Only one 

clock c is used. A transition labeled (a, fie) is taken if and 
only if both fie and {a} evaluate true. A transition labeled 
{a,fic,c := 0) is taken if and only if both fi^ and {a} 
evaluate true and along with taking the transition, the clock 
c is reset to 0. 


D. Problem formulation 

Given a sampling interval 6 and a timed behavior b : 
[0,oo) —2^^, we map the canonical sampling of b to 
a timed word T{h^) = w = (0, (To)(^) ci)... such that for 
any i > 0, ai = b{i6). We say that the timed behavior b 
satisfies the formula p in the point-based semantics under 
the sampling interval 6, denoted b^ ^ p, if and only if T (6^) 
is accepted in the timed automaton Ap that expresses p. The 
sampling interval 6 determines a sequence of positions (time 
instances) 0, 6, 26,... in the timed behavior. With i5 being 
a positive infinitesimal, any position in a timed behavior b 
appears in the timed sequence of the timed word T{b^). 
Thus, we say that the timed behavior b satisfies p in the 
continuous or dense-time semantics, i.e., b \= p, if and 
only if lim 5 _>.o T(6'^) G L{Ap,). A formal definition of 
satisifiability of MTL formulas over dense-time and point- 
based semantics is given in [18] and the relation between 
these two semantics has been studied in [16]. 

We say that a sample path of the SDE in Q satisfies an 
MITL formula p in the dense-time semantics (resp. point- 
based semantics under the sampling interval 6) if its timed 
behavior satisfies p in the dense-time semantics (resp. point- 
based semantics under the sampling interval 6). Formally, 
let x{-,w) where ru G 17 be a sample path of the stochastic 
process {x{t),t > 0}. We have that lim 5 _j.o[F(a;(', tu))]'^ ^ 
p is equivalent to L{x{-,w)) \= p. 

Given a stochastic process x(-) and an admissible control 
law m(-) that solve the SDE in Q, the probability of 
satisfying a formula p in the system under the control law 
m(-) is the sum of probabilities of continuous sample paths 
of x(-) that satisfy the formula p in the dense-time or point- 
based semantics (with respect to a given sampling interval). 

Problem 1: Given an SDE in Q and a timed automaton 
Ap, = (Q, 2-^^, Init, F, C, T) expressing an MITL formula 
p, compute a control input u(-) that maximizes the proba¬ 
bility of satisfying p in the dense-time semantics. 

III. Main result 

In this section, we first show that for the SDE in Q, 
Problem [T] can be formulated as a stochastic optimal con¬ 
trol problem in a system derived from the SDE with an 
augmented state space for capturing relevant properties with 
respect to its MITL specification. Then, we introduce a 
numerical scheme that computes an optimal policy in a 
discrete-approximation of the SDE in ([T]i with respect to the 
probability of satisfying the specification in the point-based 
semantics. The numerical scheme is based on the so-called 
Markov chain approximation method [13]. We prove that 
such a policy converges to a solution to Problem as the 
discretization gets finer. 

We make two assumptions. 

Assumption 1: The state space X and the clock vector 
space V are bounded. 

This condition ensures a finite number of states in the 
discrete approximation. In certain cases, we might also 
require U to be bounded in order to approximate the input 
space with a finite set. 








Assumption 2: /(•) and g{-) are bounded, continuous, 
and Lipschitz continuous in state x, while /(•) is uniformly 
so in u. 

Assumption ensures that the SDE in ([T]) has a unique 
solution for a given controller u(-). 

A. Characterizing the reachability probability 

For reachability objectives in MITL, Problem is also 
referred to as the probabilistic reachability problem. 

A state in Ai X Q X V is called a product state, following 
from the fact that it is a state in a product construction 
between the stochastic process for the controlled stochastic 
system and the timed automaton expressing the specihca- 
tion. We dehne a projection tt^ such that for a given tuple 
s, 7ri(s) is the i-th element in the tuple. The projection 
TTi is extended to sequences of tuples in the usual way: 
Tti{sp) = iti{s)'Ki{p) where s is a tuple and p is a sequence 
of tuples. 

Let S = XxQxV. For a stochastic process {x{t),t > 0}, 
we derive a product stochastic process > 0 } where 

s{t) = {x{t),q{t),v{t)) is a random variable describing 
the product state. The process {s{t),t > 0} satishes the 
following conditions. 

• s( 0 ) = (a:( 0 ), ( 7 ( 0 ), t;( 0 )) where ?;( 0 ) = 0 and 

(lnit,0) (< 7 ( 0 ), 0). 

• For any time r S [0, 00 ), let 6 = inf^ (3^, w(t) 0 
t \= (p and T is defined for {q{T), L{x{t + t)), </>)). If 
T{q{T),L{x{T + 6)),(j)) = (g',C"), then let g(T + i5) = 
q', v(t + 5)[i] = 'c(r)[i] 0 5 for Ci ^ C and v(t 0 

= 0 for Ci € C. Moreover, for all r < f < r 0 5, 
q{t) = q{T),v{t) = v{T)®t. 

Alternatively, given a sample path x(-,uj), oj € fl, suppose 
that at time r the conhguration in is {q, v), the labeling 
L{x{t + 5,uj)) and the clock vector z;©(5 trigger a transition 
precisely at time t + 6 and between the interval [r, t + 6), 
no transition is triggered. Then, the configuration in A^p 
changes from {q,v) to {q',v') also at time t + S provided 

, , , S,L{x{t+5,uj)) / / /\ , , 

that (q,v) - [q ,v ). Moreover, tor any time 

t during the time interval t < t < t + 6, the state in 
the specification automaton remains to be q and each clock 
increases by t as the time passes. 

For a measurable function / that maps sample paths in 
the process s(-) into reals, we write i3“(/) for the expected 
value of / when the initial state is s( 0 ) = s. 

The following lemma is an immediate consequence of the 
derivation procedure for the product stochastic process. 

Lemma 1: Given a set G = Ai x F x V, let Px{p) denote 
the probability of a sample path in the stochastic process 
{x{t),t > 0 } starting from a:( 0 ) = x and satisfying ip in 
the dense-time semantics and PsiG) is the probability of 
reaching the set G in the derived product stochastic process 
{s{t),t > 0} with s(0) = s. It holds that Px{p) = Ps{G). 
By Lemma we can dehne a value function in the 
product stochastic process to characterize the probability of 
satisfying p in the dense-time semantics. 


Dense-time reachability probability: The probability 
of reaching G from a product state s ^ G under a controller 
m(-) is denoted as Ps,u(G). We construct a reward function 
r : S ^ {1)0} such that r(s) = 1 g(s) where Ia(') is the 
indicator function, i.e., 1 ^( 2 :) = 1 if a: G A, and 1 ^( 2 :) = 0 
otherwise. Then, Ps,u{G) is evaluated by the value function 

Ps,u{G) = Wis,u) = E:i^J^ r(s(f))df|, 

where T is a random variable describing the stopping time 
such that T = inft>o(s(f) G G). 

The optimal value function is dehned as V(s) = 

sup^gn (■^) ^)) where 11 is the set of all admissible control 
policies for the SDE in ([T]). 

So far, we have shown that given x(-),u(-) that solve 
the SDE in Q) the probability that a sample path in the 
stochastic process x(-) satishes the MITL formula p in 
the dense-time semantics can be represented by the value 
1F(s(0),m) in the derived product stochastic process s(-) 
under the reward function r : S (1) 0}. 

B. Markov chain approximation 

In this section, we employ the methods in [13] to compute 
locally consistent Markov chains that approximate the SDE 
in Q under a given control policy. 

Given an approximating parameter h, referred to as the 
spatial step, we obtain a discretization of the bounded state 
space, denoted by X^, which is a hnite set of discrete points 
approximating X. Intuitively, the spatial step h characterizes 
the distance between neighboring and introduces a partition 
of X. The set of points in the same set of the partition 
is called an equivalent class. For each x G X^, the set 
of points in the same equivalent class of x is denoted 
[x] = {x' & X \ X < x' < X h}. We call x G X^ 
the representative point of \x\. 

We dehne an MDP = {X^,U,P^,xo) where X^ 
is the discrete state space. U is the input space, which can 
be inhnite. x U x X^ —>• [0,1] is the transition 

probability function (dehned later in this section). The initial 
state is xq G X^ such that the initial state x(0) of the SDE 
in Q satishes a;(0) G [xq]. 

Definition 5: [13] Let be the interpolation interval 
at step i for i > 0. Let ![] = 0 and = Yl'iGo 
n > 1 be interpolation times. The continuous interpolations 
x^{-),u^{-) of the stochastic processes {x^,n G Z>o} and 
{u’p, n G Z>o} under the interpolation times {t!p, n G Z>o} 
are = x^u^it) = ul, for all t G Kfit+i)- 

Given a policy G Z>o}, let {x„,n G Z>o} be 

the induced Markov chain from by such a policy. It is 
shown that if a certain condition is satished by the spatial 
step and the interpolation times, the continuous interpola¬ 
tions of {xn,n G Z>o} and {umti G Z>o} converges to 
processes x(-) and u{-) which solve the SDE in 0. 

Theorem 1: [13] Suppose Assumption]^ holds. For any 
policy {u^,n G Z>o}, let the chain induced from 
by this policy be {x^,n G Z>o}. Let denote the 




conditional expectation given {a;^,uf,0 < i < n,x^ = 
= a}. Then, for all a; S X and a G U, the chain 
G Z>o} satisfies the local consistency condition: 

= fix,a)At'^{x,a) + o{At’^{x,a)), 
([Aa;:^ - E^^i^Ax^] [Ax';, - E^i^^Ax^;,]') 

= g{x)g'{x)At^{x, a) + o{At^{x, a)), 
sup||Aa :^||2 A- 0, 

n^u) 


where Ax^ = x^j^i — x^ is the difference and At^{x, a) is 
an appropriate interpolation interval for x G X and a G U. 
As h ^ 0, the continuous interpolations x^{-),u^{-) of 
{a;Jj,n G Z>o} and {u^,n G Z>o} under the interpolation 
times {t^,n G Z>o} computed from the interpolation 
intervals At^ = At^(x^,u^), n G Z>o, converge in 
distribution to x(-),u(-) which solve the SDE in Q. 

Given a spatial step h, under the local consistency con¬ 
dition we construct the MDP over the discrete state 
space by computing the transition probability function 
from the parameters of the SDE (see [13] for the details). 
If the diffusion matrix g{x)g{x)"^ is diagonal, then the 
transition probabilities are; P^{x,a,x ± hiCi) — At{x,a) ■ 


{g{x)g'{x))i 


+ 


ft(.x,a) 


E n 
2=1 


{g{x)g'(x))i 


, and P^{x, a,x) = 1 — At{x, a) 


\fi{x,a)\ 


, where is the unit vector 


the i-th direction and fif{x,a) = max(±/j(x, a), 0). 


C. Optimal planning with the discrete approximation 

In this section, we construct a product MDP from a 
discrete approximation of the original system and the timed 
automaton expressing the system specification. Then, an 
optimal planning problem is formulated in a product MDP 
for computing a near-optimal policy for the SDE in 0 
with respect to the probability of satisfying the MITE 
specihcation in the point-based semantics. 

Given the timing constraints in MITE, we consider an 
explicit approximation method that discretizes both the 
continuous state space and time. Particularly, instead of 
computing potentially varying interpolation intervals, we 
choose a constant interpolation interval 6, referred to as the 
time step. Eor the local consistency condition to hold, it is 
required that for a given h G K", 


5 < 


E n 
2=1 


{g{x)g'(x))i. 


-f 


\Si(x,a)\ 


,Vx e A,Va e U. (3) 


Eurthermore, 5 is used as the parameter to discretize the 
clock vector space V. Let Vf = {kS | 0 < fc < | 

be the discretized space for the range Vi of clock Ci G C. 
The discretized clock vector space is = 11^=1 ... mV/. 
Since both X and V are bounded, sets X^ and are 
both hnite. The method is “explicit” given the fact that the 
advance of clock values are explicit: At each step n, if the 
clock is not reset to 0, then its value is increased by the 
interpolation interval 5. 


Let d = {h, 5) denote a tuple of spatial and time steps. 
Next, we construct a product MDP x = 

{S^,U, P‘^, So) where 5"^ = X^ x Q x V^ is the discrete 
product state space, U is the input space, : S‘^ x U x 
S'^ —>■ [0,1] is the transition probability function, dehned as 
follows. Let s = {x, q, v) and s' = {x', q', v'). Eor any a G 

U, P‘^{s,a,s') = P^{x,a,x') if and only if {q^v) 

{q', v'). Otherwise P‘^(s, a, s') = 0. The initial state is sg = 

(xo,go,0) with (Init, 0) (go,0). 

Assumption 3: There exists a spatial step h G M" and 
a choice of representative points from X such that for all 
X G X^ and all x' G [x], L{x') = L{x). 

Lemma 2: Under Assumption given x(-),u(-) that 
solve the SDE in 0 and a discretization X^ of the state 
space, we construct a discrete chain {Sn,n G Z>o} as 
follows: S'o = Sg = (xojQoiO) with x(0) G [xg] and 

(lnit,0) (<70,0); for all n G N, S„ = (x)),q„,Vn) 

where x^ G X^ is the representative point to which 

S,L(xt+i) 

x{nS) belongs, i.e., x{n6) G [x„], and {qn,Vn) ->■ 

(g„+i,u„+i). The following two statements hold. 

1) Eor all n G Z>g, the range of the random variable Sn 
is S'^. 

2) The probability of a continuous sample path in 
{x(t), t > 0} satisfying p in the point-based semantics 
under the sampling interval 5 equals the probability of 
a discrete sample path in the chain {5'„,n G Z>o} 
hitting the set = X^ x P x V^. That is, 

P.(g) ([L(X(.))]' h A = 

Ps,{SkGG'^ sndV3 <k,S,iG'^). 
Proof: To show the first statement, initially, z;(0) = 0 
is a vector of zeros, which is in V^. Suppose that at the 
n-th sampling step v{n5) G U^, at the next sampling step, 
for any clock Ci G G, either the value of Ci is increased 
by 5 or it is reset to 0 depending on the current state in the 
automaton, the clock vector and the current labeling of state 
in X. If the value of Ci is reset to 0, u((n-|-l)(5)[i] = 0 G Vf. 
Otherwise, w((n-|-l)(5)[t] = t;(n(5)[z]-|-(5 G Vf. By induction, 
all possible clock vectors we can encounter at the sampling 
times are in the set V*. Thus, the range of a random variable 
Sn for any n G Z>g is S'^, which is a subset of x Q x V. 

Let x(-,uj) with a; G U be a sample path of the process 
x(-) and p = sgsi... be the corresponding sample path 
of the chain G Z>o} given the construction method 

above. Remind that x(-, oj) satisfies (p in the point-based se¬ 
mantics under the sampling interval <5 if and only if the timed 
word T[b^), where b{-) = L{x{-,w)), is accepted in A,p. 
Since x{iS,u}) G [7ri(si)] for all i > 0, let t = 0 ^ 2(5..., 
we have that the timed word T{b^) = (t, L(7ri(p))) by 
Assumption]^. Let the run on the timed word (t, L(7ri(p))) 

u u . (I n^ 0,L(7ri(so))^ , s 5,L(7i-i(si)) 

be p such that p = (lnit,0)- > {qo,vo) ->■ 

{qi,vi).... By construction, it holds that qi = TT 2 {si) 
and Vi = TTsisi) for all i G Z>g. By dehnition of the 
acceptance condition in A^p, (t, L(7ri(p))) is accepted in 
Ap if and only if Occ(p) H E f 0, which is equivalent to 






















say that for some k G Z>o, Sk G x F x V and for all 
j < k, Sj ^ X F xV. Since in the first statement we 
have shown that the range of Sn for all n G Z>o is 
Sk G X^ X F X = G‘^ and the proof for the second 
statement is complete. ■ 

Lemma characterizes the probability of satisfying the 
specification in point-based semantics under the sampling 
interval S with the probability of reaching a set G‘^ in the 
product MDP Ai. Given the objective of maximizing the 
probability of reaching a set in an MDP, there exists a 
memoryless and deterministic policy such that by following 
this policy, from any state, the probability of reaching the 
set is maximized [19]. 

We introduce a state sink into the product MDP AA'^ 
and modify such that for all s G G‘^ and all a G U, 
P‘^{s, a, sink) = 1, and for all a G U, P'^(sink, a, sink) = 1, 
while the other transition probabilities remain unchanged. 
The product MDP A4‘^ with the augmented state set and 
the modified transition probability function is denoted AA'^. 
The reward function R : U {sink} —K is defined by 

R{s) = lcd(s). Let u : S'^U {sink} -G U he a memoryless 
and deterministic policy in and IT^ the set of all such 
policies in The value function of policy u is 


Remark 1: When the input space U for the product 
MDP is bounded, in the numerical method for the reward 
maximization problem in the product MDP, in general we 
also discretize the input space U with some discretization 
parameter e. Let be the discretized input space. Given 
the optimal policy u* for the product MDP and the optimal 
policy u^’* in the product MDP with the input space 
one can derive the bound on \W'^{so,u*) — PL‘^(so, u'^’*)| 
as a function of e, which converges to 0 as e 0 [13], 
[20]. Thus, with both discretized state and input space, 
the implemented policy is near-optimal for the SDE in 
([^1 with respect to the probability of satisfying the MITL 
specification in the point-based semantics. 

D. Proof of convergence 

Based on Theorem [T] we show that the optimal policy 
synthesized in the product MDP converges to the optimal 
policy that achieves the maximal probability of satisfying 
the MITL specification in the dense-time semantics as the 
discretization in both state space and time space get finer. 

Theorem 2: Given a discretization parameter d = (/i, 5) 
where 5 satisfies the local consistency condition in Q with 
respect to the spatial step h, it holds that 


w‘^{s,u) = e: , 

.i=0 

where {Sn,n G Z>o} is the Markov chain induced 
from AA‘^ with policy u. Thus, the optimal value function 
= maXj^^Yid lV‘^(s,u), and the dynamic program¬ 
ming equation is obtained: For s G S‘^, 


Y,R{S.) 


^‘^(s) = R{s) + max 

a^U 

and I/^(sink) = 0. 


s''GS^U{sink} 


Given the optimal policy u* : S'^ U {sink} -G U that 
achieves the maximum value of F‘^(s) for all s G in 
the modified product MDP we derive a policy u* : 
S'^ ^ U hy letting u*{s) = u*{s). By the definition of 
reward function and the modified product MDP, policy u* 
maximizes the probability of hitting the set G‘^ in Ai‘^. 

A policy u : S‘^ —>■ U is implemented in the original 
system in Q in the following manner. The initial product 

state is (a;(0), qo, 0 ) with (Init, 0 ) At each 

sampling time nS, n G Z>o, let the current product state be 
{x, q, v) G S. We compute {x^, q,v) G S such that x G [x^]. 
Note that at the sampling time the clock vector is always in 
by Lemma 1^ and thus {x^,q,v) G S‘^, for which u is 
defined. Then, we apply a constant input u{{x^, q, v)) during 
the time interval [ni5, (n + 1)(5). At the next sampling time 
{n + 1)<5, according to the current state x', we compute the 

state in A^p and the clock vector such that {q, v) \ 

{q',v'). Hence, the new product state is {x',q',v') and a 
constant control input for the interval \{n + 1)(5, (n -f 2)6) 
is obtained in the way we just described. 


im^(so) = ns(o)), 

/t-fO 

where sq = (xojgojO), s(0) = (x(0),go,0) and x(0) G 
[xq]- 

Proof: First, it is noted by the local consistency 
condition and the constraint on 5 in 0, 5 is a decreasing 
function of h and when h —>■ 0, 6 —i' 0. 

For a given d — {h, <5) where 6 satisfies the constraint 
in ([^ with respect to h, let ^ U he a policy in 

the product MDP and {S^,n G Z>o} be the induced 
Markov chain. According to Theorem when h —>■ 0, 6 —i' 
0 and the continuous interpolations of {7ri(S'{[),n G Z>o} 
and {m[{ = u(S^),n G Z>o} converge in distribution to 
x(-) and u(-) that solve the SDE in Q. By the determinism 
in the transition function of the timed automaton and the 
labeling function, as h —)■ 0, G Z>o} also converges 

in distribution to {s{t) ,t > 0}, which is the product 
stochastic process derived from {x{t),t > 0}. According 
to the definition of reward functions r and R, since h ^ 0, 
we have 5 —0, xq —>■ x(0) and IL‘^(so, {«[{, n G Z>o}) 
converges to IL(s(0), {u(f), f > 0}). 

Now given the optimal control policy u'^’* : S‘^ -G 
U obtained for the product MDP A4‘^, let {S'„,n G 
Z>o} be the Markov chain induced by u‘^’* in A4‘^. 
We have that lim/i_>o kL‘^(so, {m‘^’*(S'„), n G Z>o}) = 
lim?i_>.o supy‘^(so) < y(s(0)) by the optimality of the 
value function F(s(0)). On the other hand, let u* = 
argsup„gn ^('S(O), u) be the optimal control policy in 
the continuous-time stochastic system. We construct a 
policy {u*(nS),n G Z>o} for the product MDP AA'^ 
such that the action u*{nS) is taken at the step n. We 
have V^^Sq) > W^{so,{u*{nS),n G Z>o}) by the 
optimality of V‘^{so). Since limfi^QW‘^{so,{u*{nS),n G 








Z>o}) = W{s{0),{u*{t),t > 0}) = V"(s(0)), it is 
inferred that lim;i_j.o inf y^(so) > ^(•s(O)). Therefore, 
lim,i^o V^^iso) = V{s{0)). ■ 


TV. Example 


This section illustrates the method using a motion plan¬ 
ning example for a robot modeled as a stochastic Dubin’s 
car. The dynamics of the system are described by the SDE 


dx{t) 


v{t) cos 0(f) 

dy{t) 

= 

v\t) sin 0(f) 

_de{t)_ 


u{t) 


dx{t) f{x{t),u{t))dt 


dt+g{x{t))dw, 


where x = (x, y, 9) is the coordinate and heading angle of 
the robot, v is the linear velocity and u G U = [—1, 1] is 
the angular velocity input. In this example, u = 1 is fixed 
and g{x{t)) — O. 5 / 3 , and w{-) is a 3-dimensional Wiener 
process on the probability space (17, P). 

The workspace of the robot is depicted in Eigure with 
two regions Pi and P 2 of importance. The workspace is 
constrained by the walls {{x,y) \ x G {0,5},0 < y < 
5} U {{x, y) I 0 < a; < 5, 2 / G {0, 5}}. 

The objective of the robot is to maximize the probability 
of visiting region Pi within the hrst 5 time units and 
after visiting Pi, reaching P 2 between the 3rd and 5th 
time units, while avoiding hitting the walls. We define 
atomic propositions Ri, i = 1,2, which evaluates true 
when the robot is in region P^. An atomic proposition 
HitWall evaluates true if the robot hits the surrounding 
walls. The MITE formula describing the specihcation is 
if = 0(0,5] ^HitWall) A 0(3,5](7?2 A ^HitWall)). 
Given an initial state Xq, we want to hnd an optimal policy 
that maximizes the probability of (p being satished. We 
select a spatial step h = (0.5,0.5,7r/4)^to obtain a uniform 
discretization of the state space X. Given the choice of h, 
the time step 6 is chosen to be 0.2 time units for the local 
consistency condition to hold for all state and control input 
pairs. The number of states in the MDP is 1089 and 
the number of product states in the modihed product MDP 
is 58809 (after trimming unreachable states). Remind 
that the value iteration is polynomial in the size of the MDP 
The implementation are in MATLAB® on a desktop 
with Intel(R) Core(TM) processor and 16 GB of memory. 
The computation of the product MDP takes 18 minutes and 
the value iteration converges after 50 iterations with a pre- 
specihed error tolerance of 0.01. Each iteration takes about 
6 minutes. In the value iteration we also approximate the 
input space U with a hnite set U’^ where e = 0.2 is the 
discretization parameter for the input space. 

Since the product state space of the example is 5- 
dimensional, we select to plot the optimal value for the 
states with the initial heading angle 0 = 0, the initial state of 
the timed automaton and initial clock vector 0 in Figure 
Figures and show the sample paths starting 

from Xq = (0.5,0.5,0) for a time interval [0,6] from 
different perspectives. The optimal value E'^(s) with s = 


((0.5,0.5, 0), Init, 0) is 0.54, which is the approximately 
maximal probability for satisfying ip in the point-based 
semantics under the sampling interval 0.2 in the system with 
initial state a;(0) = (0.5, 0.5,0). In simulation, there are 11 
paths (marked in blue) out of 20 sample paths that satisfy 
the specihcation in the point-based semantics. 

The drawback of the explicit approach is scalability. In or¬ 
der to compute a control policy with a hner approximation, 
we need to reduce the spatial step h as well as the time step 
6 for the local consistency condition to hold. The product 
state space becomes very large for a hne discretization. For 
example, if h is chosen to be (0.2,0.2,7r/4)^, 6 has to be 
chosen below 0.1 time units and for the simple example, 
the product MDP has 608303 states after trimming. We 
did not carry out the computation for given this hner 
discretization since it is very time consuming. We discuss 
the limitation and possible solutions to deal with the issue 
of scalability in Section [V| 

V. Conclusions and future work 

This paper proposes a numerical method based on the 
Markov chain approximation method for stochastic optimal 
control with respect to a subclass of quantitive metric tem¬ 
poral logic specihcations. We show that as the discretization 
gets hner, the optimal control policy in the discrete abstract 
system with respect to satisfying the MITE specihcation in 
the point-based semantics converges to the optimal policy in 
the original system with respect to the dense-time semantics 
for satisfying the MITE formula. The approach can be easily 
extended to bounded-time MTE formulas including signal 
temporal logic formulas. In the future work, we aim to 
investigate the error bounds introduced by the proposed 
discrete approximation method. On the other hand, since 
scalability is a critical issue in the explicit approximation 
method, we will also investigate a solution approach based 
on implicit approximation [13]. With implicit approximation 
method, we can potentially reduce the size of discrete 
abstract system by treating the clock vector as a state 
variable, whose discretization parameters are pre-dehned 
and potentially different from the interpolation interval. 
Parallel algorithms and distributed planning for large-scale 
MDPs are also considered to handle the issue of scalability. 
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